Open Source Corpus Analysis Tools for Malay
نویسندگان
چکیده
Tokenisers, lemmatisers and POS taggers are vital to the linguistic and digital furtherment of any language. In this paper, we present an open source toolkit for Malay incorporating a word and sentence tokeniser, a lemmatiser and a partial POS tagger, based on heavy reuse of pre-existing language resources. We outline the software architecture of each component, and present an evaluation of each over a 26K word sample of Malay text.
منابع مشابه
A Basic Language Resource Kit for Persian
Persian with its about 100,000,000 speakers in the world belongs to the group of languages with less developed linguistically annotated resources and tools. The few existing resources and tools are neither open source nor freely available. Thus, our goal is to develop open source resources such as corpora and treebanks, and tools for data-driven linguistic analysis of Persian. We do this by exp...
متن کاملPraaline: Integrating Tools for Speech Corpus Research
This paper presents Praaline, an open-source software system for managing, annotating, analysing and visualising speech corpora. Researchers working with speech corpora are often faced with multiple tools and formats, and they need to work with ever-increasing amounts of data in a collaborative way. Praaline integrates and extends existing time-proven tools for spoken corpora analysis (Praat, S...
متن کاملA cross-cultural study of request speech act: Iraqi and Malay students
Several studies have indicated that the range and linguistics expressions of external modifiers available in one language differ from those available in another language. The present study aims to investigate the cross-cultural differences and similarities with regards to the realization of request external modifications. To this end, 30 Iraqi and 30 Malay u...
متن کاملAn Exploratory Study of the Malay Text Processing Tools in Ontology Learning
This paper discusses the overall process of learning taxonomy from Malay texts using unsupervised conceptual clustering approach and investigates the existing Malay NLP tools as potential pre-processing tools for the proposed ontology learning approach. The tools are a maximum-entropy parser based on open NLP package, a word sense tagger and a parser based on pola grammar. A case study approach...
متن کاملPreparation of MaDiTS corpus for Malay dialect translation and speech synthesis system
This paper presents our work in acquiring a Malay dialect translation and speech synthesis corpus. In this study, an architecture of speech corpus acquisition, which including Malay dialect translation and Malay dialect grapheme to phoneme (G2P), was proposed. The pronunciation dictionary for dialectal Malay was generated through G2P tool. As dialectal Malay is considered as scarce resource, di...
متن کامل